Using deep bidirectional recurrent neural networks for prosodic-target prediction in a unit-selection text-to-speech system
نویسندگان
چکیده
Deeply-stacked Bidirectional Recurrent Neural Networks (BiRNNs) are able to capture complex, shortand long-term, context dependencies between predictors and targets due to the non-linear dependency they introduce on the entire observation when predicting a target, thanks to the use of recurrent hidden layers that accumulate information from all preceding and future observations. This aspect of the model makes them desirable for tasks such as the prediction of prosodic contours for text-to-speech systems, where the surface prosody can be a result of the interaction between local and non-local features. Although previous work has demonstrated that they attain stateof-the-art performance for this task within a parametric synthesis framework, their use within unit-selection synthesis systems remains unexplored. In this work we deploy this class of models within a unit selection system, investigate their effect on the outcome of the unit search, and perceptually evaluate it against the baseline (decision-tree-based) approach.
منابع مشابه
Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings
Automatic prediction of articulatory movements from speech or text can be beneficial for many applications such as speech recognition and synthesis. A recent approach has reported stateof-the-art performance in speech-to-articulatory prediction using feed forward neural networks. In this paper, we investigate the feasibility of using bidirectional long short-term memory based recurrent neural n...
متن کاملThe USTC System for Blizzard Challenge 2016
This paper introduces the details of the speech synthesis entry developed by the USTC team for Blizzard Challenge 2016. A 5-hour corpus of highly expressive children’s audiobook was released this year to the participants. An hidden Markov model (HMM)-based unit selection system was built for the task. In addition, we utilized deep neural networks to improve the performance of our system, in bot...
متن کاملGlobal Syllable Vectors for Building TTS Front-End with Deep Learning
Recent vector space representations of words have succeeded in capturing syntactic and semantic regularities. In the context of text-to-speech (TTS) synthesis, a front-end is a key component for extracting multi-level linguistic features from text, where syllable acts as a link between lowand high-level features. This paper describes the use of global syllable vectors as features to build a fro...
متن کاملMulti-Task Learning for Prosodic Structure Generation Using BLSTM RNN with Structured Output Layer
Prosodic structure generation from text plays an important role in Chinese text-to-speech (TTS) synthesis, which greatly influences the naturalness and intelligibility of the synthesized speech. This paper proposes a multi-task learning method for prosodic structure generation using bidirectional long shortterm memory (BLSTM) recurrent neural network (RNN) and structured output layer (SOL). Unl...
متن کاملAcoustic Modeling Using Bidirectional Gated Recurrent Convolutional Units
Convolutional and bidirectional recurrent neural networks have achieved considerable performance gains as acoustic models in automatic speech recognition in recent years. Latest architectures unify long short-term memory, gated recurrent unit and convolutional neural networks by stacking these different neural network types on each other, and providing short and long-term features to different ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015